Entity resolution (ER) seeks to identify which records in a data set refer to the same real-world entity. Given the diversity of ways in which entities can be represented, matched and distinguished, ER is known to be a challenging task for automated strategies, but relatively easier for expert humans. In our work, we abstract the knowledge of experts with the notion of a binary oracle. Our oracle can answer questions of the form "do records u and v refer to the same entity?" under a flexible error model, allowing for some questions to be more difficult to answer correctly than others. Our contribution is a general error correction tool that can be leveraged by a variety of hybrid-human machine ER algorithms, based on a formal way for selecting indirect "control queries''. In our experiments we demonstrate that correction-less ER algorithms equipped with our tool can perform even better than recent ER algorithms specifically designed for correcting errors. Our control queries are selected among those that provide strongest connectivity between records of each cluster, based on the concept ofgraph expanders (which are sparse graphs with formal connectivity properties). We give formal performance guarantees for our toolkit and provide experiments on real and synthetic data.
Robust entity resolution using random graphs / Galhotra, Sainyam; Firmani, Donatella; Saha, Barna; Srivastava, Divesh. - (2018), pp. 3-18. (Intervento presentato al convegno 44th ACM SIGMOD International Conference on Management of Data (SIGMOD), Winner of the REPRODUCIBILITY AWARD, Class A++ (GII-GRIN rating) tenutosi a Houston, TX; USA) [10.1145/3183713.3183755].
Robust entity resolution using random graphs
Firmani Donatella;
2018
Abstract
Entity resolution (ER) seeks to identify which records in a data set refer to the same real-world entity. Given the diversity of ways in which entities can be represented, matched and distinguished, ER is known to be a challenging task for automated strategies, but relatively easier for expert humans. In our work, we abstract the knowledge of experts with the notion of a binary oracle. Our oracle can answer questions of the form "do records u and v refer to the same entity?" under a flexible error model, allowing for some questions to be more difficult to answer correctly than others. Our contribution is a general error correction tool that can be leveraged by a variety of hybrid-human machine ER algorithms, based on a formal way for selecting indirect "control queries''. In our experiments we demonstrate that correction-less ER algorithms equipped with our tool can perform even better than recent ER algorithms specifically designed for correcting errors. Our control queries are selected among those that provide strongest connectivity between records of each cluster, based on the concept ofgraph expanders (which are sparse graphs with formal connectivity properties). We give formal performance guarantees for our toolkit and provide experiments on real and synthetic data.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.